Ogun State
Culture Cartography: Mapping the Landscape of Cultural Knowledge
Ziems, Caleb, Held, William, Yu, Jane, Goldberg, Amir, Grusky, David, Yang, Diyi
To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher's goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Africa > Nigeria > Ogun State > Abeokuta (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (26 more...)
Natural language processing for African languages
Recent advances in word embeddings and language models use large-scale, unlabelled data and self-supervised learning to boost NLP performance. Multilingual models, often trained on web-sourced data like Wikipedia, face challenges: few low-resource languages are included, their data is often noisy, and lack of labeled datasets makes it hard to evaluate performance outside high-resource languages like English. In this dissertation, we focus on languages spoken in Sub-Saharan Africa where all the indigenous languages in this region can be regarded as low-resourced in terms of the availability of labelled data for NLP tasks and unlabelled data found on the web. We analyse the noise in the publicly available corpora, and curate a high-quality corpus, demonstrating that the quality of semantic representations learned in word embeddings does not only depend on the amount of data but on the quality of pre-training data. We demonstrate empirically the limitations of word embeddings, and the opportunities the multilingual pre-trained language model (PLM) offers especially for languages unseen during pre-training and low-resource scenarios. We further study how to adapt and specialize multilingual PLMs to unseen African languages using a small amount of monolingual texts. To address the under-representation of the African languages in NLP research, we developed large scale human-annotated labelled datasets for 21 African languages in two impactful NLP tasks: named entity recognition and machine translation. We conduct an extensive empirical evaluation using state-of-the-art methods across supervised, weakly-supervised, and transfer learning settings.
- Africa > Sub-Saharan Africa (0.24)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Africa > Sudan (0.14)
- (112 more...)
- Media > News (1.00)
- Information Technology (1.00)
- Government > Regional Government (1.00)
- (6 more...)
NaijaNLP: A Survey of Nigerian Low-Resource Languages
With over 500 languages in Nigeria, three languages -- Hausa, Yor\`ub\'a and Igbo -- spoken by over 175 million people, account for about 60% of the spoken languages. However, these languages are categorised as low-resource due to insufficient resources to support tasks in computational linguistics. Several research efforts and initiatives have been presented, however, a coherent understanding of the state of Natural Language Processing (NLP) - from grammatical formalisation to linguistic resources that support complex tasks such as language understanding and generation is lacking. This study presents the first comprehensive review of advancements in low-resource NLP (LR-NLP) research across the three major Nigerian languages (NaijaNLP). We quantitatively assess the available linguistic resources and identify key challenges. Although a growing body of literature addresses various NLP downstream tasks in Hausa, Igbo, and Yor\`ub\'a, only about 25.1% of the reviewed studies contribute new linguistic resources. This finding highlights a persistent reliance on repurposing existing data rather than generating novel, high-quality resources. Additionally, language-specific challenges, such as the accurate representation of diacritics, remain under-explored. To advance NaijaNLP and LR-NLP more broadly, we emphasise the need for intensified efforts in resource enrichment, comprehensive annotation, and the development of open collaborative initiatives.
- Research Report (1.00)
- Overview (1.00)
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (0.46)
- Media > News (0.46)
- (2 more...)
Bridging Gaps in Natural Language Processing for Yor\`ub\'a: A Systematic Review of a Decade of Progress and Prospects
Jimoh, Toheeb A., De Wille, Tabea, Nikolov, Nikola S.
Natural Language Processing (NLP) is becoming a dominant subset of artificial intelligence as the need to help machines understand human language looks indispensable. Several NLP applications are ubiquitous, partly due to the myriads of datasets being churned out daily through mediums like social networking sites. However, the growing development has not been evident in most African languages due to the persisting resource limitation, among other issues. Yor\`ub\'a language, a tonal and morphologically rich African language, suffers a similar fate, resulting in limited NLP usage. To encourage further research towards improving this situation, this systematic literature review aims to comprehensively analyse studies addressing NLP development for Yor\`ub\'a, identifying challenges, resources, techniques, and applications. A well-defined search string from a structured protocol was employed to search, select, and analyse 105 primary studies between 2014 and 2024 from reputable databases. The review highlights the scarcity of annotated corpora, limited availability of pre-trained language models, and linguistic challenges like tonal complexity and diacritic dependency as significant obstacles. It also revealed the prominent techniques, including rule-based methods, among others. The findings reveal a growing body of multilingual and monolingual resources, even though the field is constrained by socio-cultural factors such as code-switching and desertion of language for digital usage. This review synthesises existing research, providing a foundation for advancing NLP for Yor\`ub\'a and in African languages generally. It aims to guide future research by identifying gaps and opportunities, thereby contributing to the broader inclusion of Yor\`ub\'a and other under-resourced African languages in global NLP advancements.
- North America > United States (0.14)
- Africa > Niger (0.05)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (37 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Research Report > Experimental Study (0.68)
- Information Technology (0.46)
- Education (0.46)
- Media (0.45)
VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks
Huang, Jingyuan, Huang, Jen-tse, Liu, Ziyi, Liu, Xiaoyuan, Wang, Wenxuan, Zhao, Jieyu
Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, significant challenges remain, including biases and privacy concerns. To systematically address these issues in the context of geographic information recognition, we introduce a benchmark dataset consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to $53.8\%$ accuracy in city prediction, they exhibit significant regional biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed ($-12.5\%$) and sparsely populated ($-17.0\%$) areas. Moreover, the models exhibit regional biases, frequently overpredicting certain locations; for instance, they consistently predict Sydney for images taken in Australia. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.
- North America > United States > California > Los Angeles County > Los Angeles (0.15)
- Asia > India > Karnataka > Bengaluru (0.14)
- North America > United States > New York (0.06)
- (74 more...)
INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages
Yu, Hao, Alabi, Jesujoba O., Bukula, Andiswa, Zhuang, Jian Yun, Lee, En-Shiun Annie, Guge, Tadesse Kebede, Azime, Israel Abebe, Buzaaba, Happy, Sibanda, Blessing Kudzaishe, Kalipe, Godson K., Mukiibi, Jonathan, Kabenamualu, Salomon Kabongo, Setaka, Mmasibidi, Ndolela, Lolwethu, Odu, Nkiruka, Mabuya, Rooweither, Muhammad, Shamsuddeen Hassan, Osei, Salomey, Samb, Sokhar, Murage, Juliet W., Klakow, Dietrich, Adelani, David Ifeoluwa
Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks often exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark the fine-tuning multilingual transformer models and the prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1-score. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls behind the fine-tuning baselines. Compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that the performance of LLMs is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.
- Europe > Spain (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (33 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Consumer Products & Services (1.00)
- (3 more...)
Benchmarking Randomized Optimization Algorithms on Binary, Permutation, and Combinatorial Problem Landscapes
Odeyemi, Jethro, Zhang, Wenjun
In this paper, we evaluate the performance of four randomized optimization algorithms: Randomized Hill Climbing (RHC), Simulated Annealing (SA), Genetic Algorithms (GA), and MIMIC (Mutual Information Maximizing Input Clustering), across three distinct types of problems: binary, permutation, and combinatorial. We systematically compare these algorithms using a set of benchmark fitness functions that highlight the specific challenges and requirements of each problem category. Our study analyzes each algorithm's effectiveness based on key performance metrics, including solution quality, convergence speed, computational cost, and robustness. Results show that while MIMIC and GA excel in producing high-quality solutions for binary and combinatorial problems, their computational demands vary significantly. RHC and SA, while computationally less expensive, demonstrate limited performance in complex problem landscapes. The findings offer valuable insights into the trade-offs between different optimization strategies and provide practical guidance for selecting the appropriate algorithm based on the type of problems, accuracy requirements, and computational constraints.
- Europe > Netherlands > South Holland > Delft (0.05)
- North America > Canada > Saskatchewan > Saskatoon (0.04)
- North America > United States (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)
Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening
Adewole, Michael, Giwa, Oluwaseyi, Nerrise, Favour, Osifeko, Martins, Oyedeji, Ajibola
Human motion generation is an important area of research in many fields. In this work, we tackle the problem of motion stitching and in-betweening. Current methods either require manual efforts, or are incapable of handling longer sequences. To address these challenges, we propose a diffusion model with a transformer-based denoiser to generate realistic human motion. Our method demonstrated strong performance in generating in-betweening sequences, transforming a variable number of input poses into smooth and realistic motion sequences consisting of 75 frames at 15 fps, resulting in a total duration of 5 seconds. We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality, along with visual assessments of the generated outputs.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- (3 more...)
Model Of Information System Towards Harmonized Industry And Computer Science
Faith, Edafetanure-Ibeh, Tamarauefiye, Evah Patrick, Uyi, Mark Uwuoruya
The aim of attending an educational institution is learning, which in turn is sought after for the reason of independence of thoughts, ideologies as well as physical and material independence. This physical and material independence is gotten from working in the industry, that is, being a part of the independent working population of the country. There needs to be a way by which students upon graduation can easily adapt to the real world with necessary skills and knowledge required. This problem has been a challenge in some computer science departments, which after effects known after the student begins to work in an industry. The objectives of this project include: Designing a web based chat application for the industry and computer science department, Develop a web based chat application for the industry and computer science and Evaluate the web based chat application for the industry and computer science department. Waterfall system development lifecycle is used in establishing a system project plan, because it gives an overall list of processes and sub-processes required in developing a system. The descriptive research method applied in this project is documentary analysis of previous articles. The result of the project is the design, software a web-based chat application that aids communication between the industry and the computer science department and the evaluation of the system. The application is able to store this information which can be decided to be used later. Awareness of the software to companies and universities, implementation of the suggestions made by the industry in the computer science curriculum, use of this software in universities across Nigeria and use of this not just in the computer science field but in other field of study
- North America > Canada (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom (0.04)
- (4 more...)
- Information Technology (1.00)
- Education > Educational Setting > Higher Education (0.46)
Corn Yield Prediction Model with Deep Neural Networks for Smallholder Farmer Decision Support System
Olisah, Chollette, Smith, Lyndon, Smith, Melvyn, Morolake, Lawrence, Ojukwu, Osi
Given the nonlinearity of the interaction between weather and soil variables, a novel deep neural network regressor (DNNR) was carefully designed with considerations to the depth, number of neurons of the hidden layers, and the hyperparameters with their optimizations. Additionally, a new metric, the average of absolute root squared error (ARSE) was proposed to address the shortcomings of root mean square error (RMSE) and mean absolute error (MAE) while combining their strengths. Using the ARSE metric, the random forest regressor (RFR) and the extreme gradient boosting regressor (XGBR), were compared with DNNR. The RFR and XGBR achieved yield errors of 0.0000294 t/ha, and 0.000792 t/ha, respectively, compared to the DNNR(s) which achieved 0.0146 t/ha and 0.0209 t/ha, respectively. All errors were impressively small. However, with changes to the explanatory variables to ensure generalizability to unforeseen data, DNNR(s) performed best. The unforeseen data, different from unseen data, is coined to represent sudden and unexplainable change to weather and soil variables due to climate change. Further analysis reveals that a strong interaction does exist between weather and soil variables. Using precipitation and silt, which are strong-negatively and strong-positively correlated with yield, respectively, yield was observed to increase when precipitation was reduced and silt increased, and vice-versa.
- Africa > Nigeria > Enugu State > Enugu (0.05)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
- Africa > Nigeria > Plateau State (0.04)
- (7 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)